Swiftkey is a text messaging application that predicts the next word as you type a message. When the user types: "I went to the " : the application presents three options for what the next word might be. For example, the three words might be gym, store, restaurant.
In this project, we use R text mining tools to build a application that predicts the next word entered into an shiny application. We use data scraped from blogs, twitter and news sites to build a language model to perform prediction. We predict the next word by calculating the maximum likelihood estimate (MLE) based on the language model.
In a nutshell, here’s a summary of the data analysis performed in this report.
search node in a tree.search node.The Data Science Capstone Course provided text data from 3 data sources: blogs, twitter and news. The table below shows the number of lines from data source.
| filename | line_count | |
|---|---|---|
| 1 | en_US.blogs.txt | 899288 |
| 2 | en_US.news.txt | 1010242 |
| 3 | en_US.twitter.txt | 2360148 |
| 4 | Total | 4269678 |
We use a framework for text mining applications within R to transform the unstructured text to a structured document-by-term matrix format required for analysis. The first step is “clean” the text prior with a series of text processing functions. We use the following preprocessing steps.
Second, we tokenize the text into ngrams. For our analysis, a gram equals a whitespace delimited sequence of characters corresponding to an English word. The N in ngram corresponds to the number of words considered to be part of the same unit. For example: a 1-gram is new, 2-gram is new york, 3-gram is new york city and a 4-gram is new york city police.
Finally, we build a document-by-term matrix by transforming the ngrams to a bag-of-words model. The columns in a docterm matrix each unique ngram. The row represent a document and the frequency that each ngram appears in the document.
The 4.26967810^{6} lines from the complete dataset can be memory intensive for the text mining tools and slow the analysis. To speed things up, we subsample 1% from the complete dataset and then work with the subsampled data for exploration and modeling. The subsampling implementation is in Appendix 1.
| source | num_lines | num_unique_words | mean_word_freq | median_word_freq |
|---|---|---|---|---|
| 23602 | 8040 | 20 | 9 | |
| blogs | 8993 | 12414 | 15 | 6 |
| news | 10103 | 12850 | 15 | 7 |
The table above provides summary statistics for the sampled data. We see that the twitter data has fewer unique words - which makes sense given the 140-character limit. We also see that the mean word frequency is nearly twice the median frequency
For example, look at the word frequency distribution for the sample data
p <- all_docs_word_plot(sample_vector_corpus)
print(p)
Let’s load all the data sources into 1 corpus.
docs <- load_sample_dircorpus()
docs <- preprocess_entries(docs)
Here are top bigrams.
ngram_2 <- get_docterm_matrix(docs, 2)
p2 <- generate_word_frequency_plot(ngram_2$wf, "Top Bigrams for Sampled Text")
print(p2)
Here are top tri-grams
ngram_3 <- get_docterm_matrix(docs, 3)
p3 <- generate_word_frequency_plot(ngram_3$wf, "Top Trigrams for Sampled Text")
print(p3)
Here are top 4-grams
ngram_4 <- get_docterm_matrix(docs, 4)
p4 <- generate_word_frequency_plot(ngram_4$wf, "Top 4-grams for Sampled Text")
print(p4)
We build a tree using the ngrams and compute MLE () using the Dirichlet-multinomial model. We use node.tree which can build a tree from a data.frame. Now lets perform a search for “data”.
Here are the maximum likelihood estimates. They show 6% likelihood that entry will be the next word: “data entry” has a frequency = 12 and “data” has a frequency of 198 - so the maximimum likelihood estimate is 6.1%.
results <- perform_search(ngram_tree, c("data"))
print(results)
## 12 10
## recommended_words "entry" "streams"
## likelihood "0.0606060606060606" "0.0505050505050505"
## 8 7
## recommended_words "recovery" "dating"
## likelihood "0.0404040404040404" "0.0353535353535354"
## 7
## recommended_words "personalize"
## likelihood "0.0353535353535354"
Then if we query for “data entry”, we search the tree the nodes “data” then “entry” and we will recommend the words “just” and “respond”.
results <- perform_search(ngram_tree, c("data", "entry"))
print(results)
## 6 6
## recommended_words "just" "respond"
## likelihood "0.5" "0.5"
This code collects a 1% sample using a “coin flip” to decide which lines to choose.
# sample the datasci dir
sample_capstone_data <- function(fn, outfn, sample_len=0.01) {
print(sprintf("Reading %s", fn))
lines <- readLines(fn)
set.seed(123)
print(sprintf("Read %s Length %s", fn, length(lines)))
lines_sample <- lines[rbinom(length(lines)*sample_len, length(lines), 0.5)]
print(sprintf("Writing %s. Length %s", outfn, length(lines_sample)))
write.csv(lines_sample, file=outfn, row.names=FALSE, col.names=FALSE)
}
sample_capstone_data("./data/final/en_US/en_US.twitter.txt",
"./data/final/en_US/sample/en_US.twitter.txt")
sample_capstone_data("./data/final/en_US/en_US.blogs.txt",
"./data/final/en_US/sample/en_US.blogs.txt")
sample_capstone_data("./data/final/en_US/en_US.news.txt",
"./data/final/en_US/sample/en_US.news.txt")